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RENEGAR’S CONDITION NUMBER, SHARPNESS 
AND COMPRESSED SENSING PERFORMANCE 

VINCENT ROULET, NICOLAS BOUMAL, AND ALEXANDRE D’ASPREMONT 


Abstract. We show that several quantities controlling compressed sensing performance also directly control 
algorithmic complexity. We describe linearly convergent restart schemes solving a broad range of compressed 
sensing problems using first-order methods. The key term controlling convergence measures the sharpness of 
the optimum and can be interpreted as a condition number, computed as the ratio between the true signal spar¬ 
sity and the maximum signal size that can be recovered by the observation matrix. In a similar vein, Renegar’s 
condition number is a data-driven computational complexity measure for convex programs, generalizing clas¬ 
sical condition numbers for linear systems. We provide evidence that for a broad class of compressed sensing 
problems, the worst case value of this algorithmic complexity measure taken over all signals matches the re¬ 
stricted eigenvalue of the observation matrix, which controls compressed sensing performance. This condition 
number also measures the robustness of the recovered solution with respect to a misspecification of the obser¬ 
vation matrix A, a point rarely addressed by classical recovery results. Overall, this means that, in compressed 
sensing problems, a single parameter directly controls computational complexity and recovery performance. 


1. Introduction 

Several recent results have highlighted a clear tradeoff between computational complexity on one side, 
and statistical performance on the other (i.e., the number of samples required to recover the signal). We 
focus on sparse recovery problems written 

minimize ||x|| 
subject to Ax = b 

in the variable x G where A G is a sensing matrix and 6 G is the vector of observations. Here, 
II • II is a sparsity inducing norm (e.g., £i) whose properties will be specified below. Donoho and Tanner 
[2005] and Candes and Tao [2006] have shown that, for certain matrices A, when the observations y are 
generated by a sparse signal, i.e., when b = Ax* and Card(x*) = k so the signal is sparse, 0{klogp) 
observations suffice for stable recovery of x* by solving problem (1) with the ii norm, in which case (1) 
is a linear program. These results have been generalized to many other recovery problems with various 
assumptions on the signal structure (e.g., where x is a block-sparse vector, a low-rank matrix, etc.) and a 
library of corresponding convex relaxations has been developed to recover these more complex structures. 

Many algorithms have also been developed to solve compress sensing problems at scale. Besides spe¬ 
cialized methods such as LARS [Efron et ah, 2004], the classical FISTA [Beck and Teboulle, 2009] and 
NESTA [Becker et ah, 201 la] solvers use accelerated gradient methods to solve LASSO problems, with effi¬ 
cient and flexible implementations covering a much wider range of compressed sensing instances developed 
in e.g. [Becker et ah, 2011b]. Several authors have also studied restart schemes and ODE interpretations 
[O’Donoghue and Candes, 2015; Su et ah, 2014; Giselsson and Boyd, 2014] to speed up convergence in this 
context. More recently, linear convergence results have been obtained for recovery problems, with [Agarwal 
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et al., 2011; Yen et al., 2014; Zhou et ah, 2015] showing linear convergence of some first-order methods 
using variants of the strong convexity assumption. 

In sparse recovery problems, statistical performance is usually measured in terms of the number of sam¬ 
ples required to guarantee stable recovery, while computational performance is usually measured in terms 
of classical bounds on the computational cost of the corresponding convex optimization problems or M- 
estimators. Early on, it was noticed, for example in [Donoho and Tsaig, 2008], that recovery problems which 
are easier to solve from a statistical point of view (i.e., where more samples are available), are also easier 
to solve numerically. The results in [Donoho and Tsaig, 2008] focused on homotopy methods and were 
essentially empirical. More recently, the authors of [Chandrasekaran and Jordan, 2013; Amelunxen et ah, 
2014] studied computational and statistical tradeoffs for increasingly tight convex relaxations of shrinkage 
estimators. They show that recovery performance is directly linked to the Gaussian squared-complexity of 
the tangent cone with respect to the constraint set and study the complexity of several convex relaxations. In 
[Chandrasekaran and Jordan, 2013; Amelunxen et ah, 2014] the structure of the convex relaxation is varying 
and affecting both complexity and recovery performance, while in [Donoho and Tsaig, 2008] and in what 
follows, the structure of the relaxation is fixed, but the data (i.e., the observation matrix A) varies. 

Here, following results in [Roulet and d’Aspremont, 2017], we first describe linearly convergent restart 
schemes solving a broad range of compressed sensing problems using first-order methods. The key term 
controlling convergence measures the sharpness of the optimum can be interpreted as a condition number, 
computed as the ratio between the true signal sparsity and the maximum signal size that can be recovered 
by the observation matrix. 

In a similar spirit, we show that the cone restricted eigenvalues introduced in [Bickel et ah, 2009] corre¬ 
spond to the worst-case value of Renegar’s condition number for problem ( 1 ) taken over a class of signals xq. 
This means that a single quantity drives both the complexity of solving problem (1) and its recovery per¬ 
formance, i.e., the number of samples required for exact recovery, and the solution’s robustness when the 
observations y are noisy. This same condition number also controls the impact of misspecification in A 
on the optimal solution to (1). From a compressed sensing perspective, this confirms that obtaining more 
samples also makes the reconstructed solution more robust to experimental uncertainty in A. 

2. Sharpness & Lower Bounds 

In what follows, we show that Nullspace Property conditions (see e.g. [Cohen et ah, 2009]) produce 
sharpness results on the optimum. In particular, in the £i setting, we show that if x solves the sparse 
recovery problem (1), then 

2-C„ ... 

———||a: — x||i < ||x||i — ||a:||i 
(_/ 

for any x such that Ax = b. This sharpness bound on the optimum will allow us to produce restart schemes 
accelerating the performance of classical optimization algorithms. Furthermore, the constant C controlling 
acceleration depends explicitly on both the sparsity of the solution and on the recovery threshold of the 
observation matrix A. This directly links quantities controlling sparse recovery performance with measures 
of computational complexity. For simplicity, we start by describing the £i setting, we then generalize our 
results to other sparsity inducing norms. 

2.1. The li setting. We first briefly recall key results on sparse recovery using the £i norm, then use these 
results to produce sharpness bounds on the recovery problem. 

Definition 2.1. (Nullspace Property [Cohen et al., 2009]) The matrix A satisfies the Nullspace Property 
(NSP) of order k with constant C > 1 iff 

||a^||i < C\\xt<= 

for any x G AA(A) and T C [l,p] with Card(T) < k. 
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1 ) 


(NSP) 



Given a matrix A G and observations b = Ax* on a signal x* G recovery is performed by 

solving the following ii minimization program 


minimize ||a:||i 
subject to Ax = b 


( 2 ) 


in the variable x G We call x the optimal solution of this problem. The nullspace property means this 
convex program recovers all signals up to some sparsity threshold on these sparse recovery problems. 

Proposition 2.2. Given a coding matrix A G satisfying the Nullspace Property (NSP) at order k with 
constant 1 < C < 2, then sparse recovery x = x* is guaranteed if Card{x*) < k, and 

2C 


\x — X 1 < 


inf 


\u — X 1 


2 — C {Cardii<A:} 
where x solves the ii-minimization LP and x* is the true signal. 

Proof, (see e.g. Cohen et al. [2009] Th. 4.3). If A satisfies the NSP at order 2k with constant C, then 

ll^^lll < C\\zT<^\\i 

for any z G Af{A) and T C [l,p] with Card(r) < 2k, means 

C „ „ 


kill > 


NtIIi- 


C- 1' 

Now let T = supp(x*) and let x x* such that Ax = b, so z = x — x* satisfies Az = 0, then 

Iklli = Ikr +-^rlli + IkT'^lli 

> Ikrlli “ Ikrili + IkTHli 

= ||x*||i + Iklli - 2||z'r||i 

and C <2 means 


l^lli — 2||2;'r||i > Iklli — 


C 


kj'iii > 0 


C- 1' 

hence ||x||i > ||x*||i, so x = x*. The error bound follows from similar arguments. ■ 

We can use these last results to bound suboptimality using the distance to the optimal set. We get the 
following proposition bounding the sharpness of the optimum of problem (1). 

Proposition 2.3. Given a coding matrix A G satisfying the Nullspace Property (NSP) at order k with 
constant 1 < C < 2. Let x be the solution of program (2) for b = Ax* with Card(x*) < k. Let x G 
satisfy Ax = b, we have 

. 2-C„ 

kill — ||x||i > —— —||x —x||i (Sharp) 


C 


with X = X*. 


Proof. The hypotheses of Proposition (2.2) are satisfied sox = x*,z = x — xG AA(j4) and following 
fhe proof of that proposition we get 


and 


yields the desired result. 


kill “ Iklli ^ Iklli “ 2|k7’||i 
c „ „ 


kill > 


c-1 


GT\\i 


The nullspace property (NSP) ensures that there are no (approximately) sparse vectors in the nullspace of 
the observation matrix A. We can give a more concrete geometric meaning to the constant C in (NSP) by 
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connecting it with the diameter of a section of the fi ball by the nullspace of the matrix A (see e.g. Kashin 
and Temlyakov [2007] for more details). 

Lemma 2.4. Suppose A G satisfies 

^ diam(i?^ n AA(A)) = sup \\x\\2 < 

2 Ax=0 

\\ x \\ i<l 


then A satisfies (NSP) at order kx with constant 

C = 


1 - y/kT/kj:, 

provided kx < k^. 

Proof. For any x G Af{A) and support set T with Card(T) < kx, 

||xr||i < s/kr lla^lb < s/kx/kx ||a;||i, 


( 3 ) 


which means 

Ikr'^lli > (1 — s/kT/kD)\\x\\i 

hence the desired result. ■ 


Precise estimates of the diameter of random sections of norm balls can be computed using classical 
results in geometric functional analysis. The low M* estimates in [Pajor and Tomczak-Jaegermann, 1986] 
for example show that when E is random subspace of codimension k (e.g. the nullspace of a random 
matrix A), then 

diam(Brn£:)<cyh?|M 

with high probability, where c > 0 is an absolute constant. In this case, Proposition 2.3 means that if x 
solves the recovery problem in (1), then (Sharp) reads 


> ^^/ ^rd(r)log(n/fcy ^ 


( 4 ) 


where T is the support of the true signal x*. This means that the sharpness of the optimum of (1) is 
essentially controlled by the ratio of the true signal size Card(r) with the maximum signal size k that can 
be recovered w.h.p. by the observation matrix A. 


2.2. General sparsity inducing norms. We now generalize the results above using the notion of sparsity 
structures introduced by Juditsky et al. [2014], which allows a common treatment of popular norms such 
as the ii norm, group-^i norms and the nuclear norm. Sparsity structures define sparsity (or simplicity) of 
signals through projectors. 

We begin by briefly recalling the setting in [Juditsky et ah, 2014]. Consider X and £, two Euclidean 
spaces, and a map B: X ^ £. In most cases, notably including the ii and nuclear norm, X = £ and 

we may think of B as the identity map, but it is useful to consider more general B to model group norms 

as well. In this setting, the problem under consideration is that of finding a sparse representation Bx of a 
signal, given noisy observations y = Ax. 

Definition 2.5. (Sparsity structure [Juditsky et al., 2014]) A sparsity structure on £ is defined as a norm 
II • II on £, together with a family V of linear maps of £ into itself satisfying three assumptions: 

(1) Every P £V is a projector, = P, 

(2) Every P ^ P is assigned a weight v{P) > 0 and a linear map P on £ such that PP = 0, 

(3) Eor any P € V and f,g £ £, one has 

\\p*f p < max(||/||*, IIpII*), 
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where || • ||* is the dual norm of\\ • || and P* is the conjugate mapping of the linear map P. 

The last condition in Definition 2.5 is arguably the least intuitive and Lemma 4.1 connects it, in some 
cases, with the more intuitive notion of decomposable norm. For k>Q, let 

Vk = {P£V : v{P) < k}. 

The notion of sparsity is defined as follows: a vector w is said to be fe-sparse if there exists P £ P^ such 
that Pw = w. A signal x is said to be /c-sparse if its representation Bx is /c-sparse. 

ii norm. In the the ii norm case, the sparsity structure is defined over S = JV = the map B reduces to 
the identity, and V is the set of projectors on coordinate subspaces of that is, V contains all projectors 
which zero out all coordinates of a vector except for a subset of them, which are left unaffected. The 
companion maps are the complementary projectors: P = I—P. Naturally, the complexity level corresponds 
to the number of coordinates preserved by P, i.e., z^(F’) = Rank(P). These definitions recover the usual 
notion of sparsity. 

Nuclear norm. The nuclear norm is defined for matrices X £ with singular values ai{X) as ||X|| = 

^ sparsity system by setting X = £ = B = L Its associated 

family of linear maps is 

P: X Tjeft^fPright, 
and 

P: Fright), 

where Fieft £ and Fright £ are orthogonal projectors. Their weights are defined as z^(F) = 

max (Rank(Fieft), Rank(Fright)) defining therefore fe-sparse matrices as matrices of rank at most k. 

Definition 2.5 allows us to revisit all the results of Section 2.1. This is essentially a direct generalization, 
with the caveat that since we do not assume F + F = I (which roughly corresponds to decomposable 
norms), the recovery conditions in [Juditsky et ah, 2014] that we recall below differs slightly from (NSP). 
In the setting discussed above, \&ix £ X solve the following optimization problem 

minimize ||Fx|| 

subject to Ax = b ^ 

in the variable x £ X. [Juditsky et ah, 2014] then show a slightly more general version of the following 
stable recovery result. 


Proposition 2.6. Suppose x £ X solves the recovery problem (5) with observations b = Ax* where x* is 
the true signal, up to a precision e, and that the true signal x is nearly sparse, i.e. there exists P £ Vk such 
that 


\\{I-P)Bx\\ <4 


Assume also that the following condition holds 

||FFz|| — ||FF 2 ;|| + \\Bz\\ < ||Fz||-p, 


for any z £ X, P £ Vk, then 

\\Bx - Bx* 

where || • || is the norm defined in Def. 2.5. 

Proof. See [Juditsky et ah, 2014, Prop. 3.1]. ■ 


e 4“ 26 x 

< — - - 

1-7 


( 6 ) 

(V) 


The condition in (6) is slightly stronger than the classical nullspace property but ensures stable recovery 
when problem (5) is only solved approximately. This result also allows us to produce a direct generalization 
of Proposition 2.3, as follows. 
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Proposition 2.7. Given a sparsity system V satisfying assumption ( 6 ) in Proposition 2.6, let x be the solution 
of program (5) for b = Ax* with Card(x*) < k. If x G X satisfies Ax = b, then 

||3:||i — ll^lli > (1 — 7)lk — i;||i (Sharp-Gen) 


with X = X*. 


Proof. If V satisfies assumption ( 6 ) in Proposition 2.6, then [Juditsky et ah, 2014, Prop. 3.1] shows that 
X = X*. For z G AA( 74 ) we have 

\\PBz\\ - \\PBz\\ > {l-y)\\Bz\\ 

which combined with [Juditsky et ah, 2014, Lem. 3.1] and the fact that Ax = b yields the desired result. ■ 


This last bound is a direct generalization of the sharpness result in Proposition 2.3, with (Sharp-Gen) 
extending the inequality in (Sharp). While the coefficient (1 — 7 ) in (Sharp-Gen) is perhaps less intuitive 
than the condition number in (Sharp), we observe that here too this same coefficient controls both recovery 
stability in the error bound (7) of Prop. 2.6, and sharpness (hence computational complexity as we will see 
below) in bound (Sharp-Gen). 


2.3. Restarting First-Order Methods. In this section we seek to solve the recovery problem (1) assuming 
that the sharpness bounds (Sharp) hold. The NESTA method detailed in [Becker et ah, 2011a] uses the 
smoothing argument in [Nesterov, 2005] to solve (1). In practice, this means using the optimal algorithm in 
[Nesterov, 1983] to minimize 

f^{x)= sup u'^x-p\\u\\l/2 

II Tt|| 00 ^ 1 


for some p > 0, which approximates the £i norm uniformly up to pp/2. This is the classical Huber function, 
which has a Lipschitz continuous gradient with constant equal to 1/p. Starting at a point xq, t iterations of 
the optimal algorithm in [Nesterov, 1983] will then yield a point xt satisfying 

2 ||xo-x*"^ 


1 - FI < 


pP 


Jl 

2 


and the optimal bound is reached for p = \/ 2 ||xo — x* \\ 2 /{t^/p) and reads 


Ft T - 


3Vp|ko - 3:*||2 
t 


( 8 ) 


As in [Roulet and d’Aspremont, 2017], we write A{xq, t) = xt the output of this algorithm and describe a 
restart scheme exploiting the sharpness result in (Sharp) to improve the computational complexity of solving 
problem (1). In fact, when Proposition 2.3 holds, combining (9) and (Sharp) yields 


\xt\\i - ||x||i < 


^Vp\\xo - X*\\2 


< 


3^C 

t{2-C)' 


kolli - ll^lli) 


hence if we pick t such that 

t{2-C) 

the simple, constant restart scheme running summarized in Algorithm 1 . After r outer iterations, hence a 
total of N = tr inner iterations, the algorithm will produce a point Hr satisfying 


l|y||i - 


\t{2-C)) 


minimizing this last bound in t yields the following proposition. 
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Proposition 2.8. Given a coding matrix A G satisfying the Nullspace Property (NSP) at order 2k 

with constant C <2. Let x be the solution of program {2) for b = Ax* with Card(x*) < k. After running 
a total of N inner iterations in Algorithm I, we get a point y € ML such that 


II2/II1 - 


l^lli < exp 


Ar(2-(7)\ 

3e^C ) 


(9) 


for t = 3ey/pCI{2 — C), hence N/t restarts. 


Recall from (3) that for random observations, we have 

—— = 1 - 2^kT/kD 

where kx is roughly the largest signal size that can be recovered by the observations A and kx is the size 
of the true signal. This means that the complexity of the optimization problem (1) decreases with the 
complexity of the statistical recovery problem, with both quantities being controlled by the oversampling 
ratio kx)/kx- 


Algorithm 1 Restart Scheme 

Input: Initial point yo G ML 
For i = 1... ,T compute 

yi = A{yi-i,t) (Restart) 

Output: A point pr approximately solving (1). 


3. RENEGAR’S condition number & RESTRICTED EIGENVALUES 

Renegar’s condition number is a data-driven computational complexity measure for convex programs, 
generalizing classical condition numbers for linear systems. In what follows, we show that for a broad class 
of compressed sensing problems, the worst case value of this algorithmic complexity measure taken over 
all signals matches the restricted eigenvalue of the observation matrix, which controls compressed sensing 
performance. 

3.1. Computational complexity. We begin by addressing computational complexity aspects of problem (1). 
Computational complexity for convex optimization problems is often described in terms of polynomial func¬ 
tions of the problem size. This produces a clear link between problem structure and computational complex¬ 
ity but fails to account for the nature of the data. If we use linear systems as a basic example, unstructured 
linear systems of dimension n can be solved with complexity O(n^) regardless of the matrix values, but 
iterative solvers will converge much faster on systems that are better conditioned. The seminal work of 
[Renegar, 1995a, 2001] extends this notion of conditioning to optimization problems, producing data-driven 
bounds on the complexity of solving conic programs, and showing that the number of outer iterations of 
interior point algorithms increases as the distance to ill-posedness decreases. 

In what follows, we study the complexity of the oracle certifying optimality of a candidate solution x* 
to (1) as a proxy for the problem of computing an optimal solution to this problem. As we will see below, 
certifying optimality means solving a pair of alternative conic linear systems of the form 

Ax = 0, X G C (10) 

and 

- A^b G C* 


1 


( 11 ) 








for a given cone C C Several references have connected Renegar’s condition number C{A) (which 
will be defined more precisely below) and the complexity of solving conic linear systems using various 
algorithms [Renegar, 1995a; Freund and Vera, 1999b; Epelman and Freund, 2000; Renegar, 2001; Vera 
et ah, 2007; Belloni et ah, 2009]. In particular, Vera et al. [Vera et ah, 2007] link C(^) to the complexity of 
solving the primal dual pair (lO)-(ll) using a barrier method. They show that the number of outer barrier 
method iterations grows as 

O C{A ))), 

where i>c is the barrier parameter, while the conditioning (hence the complexity) of the linear systems 
arising at each interior point iteration is controlled by C{A)‘^. This link was also tested empirically on 
linear programs using the NETLIB library of problems in [Ordonez and Ereund, 2003], where computing 
times and number of iterations were regressed against estimates of the condition number computed using 
the approximations for C{A) detailed in [Ereund and Vera, 2003]. 

Studying the complexity of computing an optimality certificate in (10) gives insights on the performance 
of oracle based optimization techniques such as the ellipsoid method. Of course, these abstract methods are 
very different from those used to solve problem ( 1 ) in pratice. However, we will observe in the numerical 
experiments of Section 4 that the condition number is strongly correlated with the empirical performance of 
efficient recovery algorithms such as EARS [Efron et ah, 2004] and Homotopy [Donoho and Tsaig, 2008; 
Asif and Romberg, 2014]. 

We now briefly recall optimality conditions for problem (1) and two equivalent constructions for the 
condition number of a conic linear system. Define the tangent cone at point x with respect to the norm || • ||, 
that is, the set of descent directions for the norm || • || at x, as 

T{x) = conejz : \\x + z\\ < ||x||}. (12) 

The simple lemma below characterizes unique optimal solutions to problem (1) in terms of T{x). 

Lemma 3.1. The point x* is the unique minimizer of (1) if and only if Null{A) n T{x*) = {0}. 

Proof. This follows from standard KKT conditions (see for example [Chandrasekaran et ah, 2012, 
Prop 2.1]). ■ 


In other words, x* is the unique optimizer if and only if the following problem is infeasible 

find z 

s.t. Az = 0 (P) 

2 ; G T{x*), z 0, 

in the variable ^ G MT. To certify feasibility, it is sufficient to exhibit a solution. One way of certifying 
infeasibility of (P) is to solve the dual problem 


find u 

s.t. —A^u G T(x*)°, M / 0, 


(D) 


in the variable u G M”, where T{x*)° is the polar cone of T{x*). Renegar’s condition number [Renegar, 
1995a,b; Pena, 2000] provides a data-driven measure of the complexity of this task. It is rooted in the 
sensible idea that certifying infeasibility is easiest if the problem is far from being feasible. Eormally, the 
distance to feasibility Px*{A) is defined as follows. Eet = {A G : (P) is infeasible}. Then, 

using the spectral norm as matrix norm, 

pf*(A) ^inf{||AA ||2 : A + AA^Aff*}. (13) 


Renegar’s condition number for problem (P) with respect to x* is then defined as the scale-invariant recip¬ 
rocal of this distance 


C,.(A) 


A 
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PII2 _ 

Px* (^) ’ 


(14) 



( 15 ) 


We can also define conically restricted minimal singular value of A as follows 


fj,^* (A) = inf 

z&T{x*) 


\ Z \\2 


Interestingly, this last quantity turns out to be equal to the distance to infeasibility and we have the following 
result. 


Lemma 3.2. Distance to feasibility and cone restricted eigenvalues match, i.e. (A) = px* (A)- 

Proof. When (P) is feasible, both vanish. Otherwise, see [Freund and Vera, 1999a, Th. 2], or simplified 
versions in [Belloni and Freund, 2009, Lem. 3.2] and [Amelunxen and Lotz, 2014]. ■ 


Notice that, if T{x*) were the whole space and if A^A were full-rank (never the case if n < p), then 

p{A) would be the smallest singular value of A. As a result, C{A) would reduce to the classical condition 
number of A (and to oo when AAA is rank-deficient). Renegar’s condition number is necessarily smaller 
(better) than the latter, as it further incorporates the notion that A need only be well-conditioned along those 
directions that matter with respect to the norm || • || at x*. Later, we will remove the dependence on x* by 
considering a worst-case condition number over classes of “simple” signals. 

When the primal problem (P) is feasible, so that px* (A) = 0, the condition number as defined here is 
infinite. While this correctly captures the fact that, in that regime, statistical recovery does not hold, it does 
not properly capture the fact that, when (P) is ’’comfortably” feasible, certifying so is easy, and algorithms 
terminate quickly (although they return a useless estimator). From both a statistical and a computational 
point of view, the truly delicate cases correspond to problem instances for which both (P) and (D) are only 
barely feasible or infeasible. This is illustrated in simple numerical example in [Boyd and Vandenberghe, 
2004, §11.4.3] and in our numerical experiments, corresponding to the peaks in the CPU time plots of 
the right column in Figure 3: problems where sparse recovery barely holds/fails are relatively harder. For 
simplicity, we only focused here on distance to feasibility for problem (P). However, it is possible to 
symmetrize the condition numbers used here as described in [Amelunxen and Lotz, 2014, §1.3], where a 
symmetric version of the condition number is defined as 


n{A) 


min 


Mil Mil 1 


This quantity peaks for programs that are nearly feasible/infeasible. Naturally, the condition number also 
controls the sensitivity of the solution to changes in the matrix A, with [Renegar, 1994, 1995b] for ex¬ 
ample directly bounding changes in the solution to (10) in terms of C{A) and changes A A in the system 
matrix. This means that C{A) also measures the robustness of the solution to the recovery with respect to 
misspecification of the observation matrix A, a point rarely addressed by classical recovery results. 


3.2. Statistical performance. We now focus on the link between condition number and the statistical per¬ 
formance of the solution of problem (1). To this end, assume now that the observations y are affected by 
noise and that we solve a robust version of problem (1), written 


minimize ||x|| 

subject to \\Ax — b \\2 < (5||^||2, 


(16) 


in the variable x G M^, with the same design matrix A G observations y G and noise level J > 0. 


3.2.1. Recovery bounds using C{A). The following classical result bounds the reconstruction error in terms 
ofC(A). 


Lemma 3.3. Suppose we observe y = Axq + w where ||ru ||2 < (J||A ||2 and let x* be an optimal solution of 
problem (16). We get the following error bound: 


* 


Xo\\2 < 2 


bxQ (^) 
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X 


26 -CxoiA). 


(17) 






Proof. We recall the short proof of [Chandrasekaran et ah, 2012, Prop. 2.2]. Both x* and xq are feasible 
for (16) and x* is optimal, so that ||x*|| < ||xo||. Thus, the error vector x* — xq is in the tangent cone 
T{xq) (12). By the triangle inequality, 

\\A{x* - xq )\\2 < \\Ax* - y \\2 + 11 ^ 2:0 - yh < 2(5||^||2. 

Furthermore, by definition of ^xq (15), 

||A(a:* - xo)||2 > \\x* - xoh- 

Combining the two concludes the proof. ■ 


Notice that, in the above lemma, the condition number is evaluated at xq (the true signal) rather than at x* 
(the estimator). This will be convenient when considering worst cases over classes of target signals. 

This means that Renegar’s condition number defined in (14) also controls the statistical performance 
of estimators built on solving the approximate recovery problem (16). This at least partially explains the 
common empirical observation (see, e.g., [Donoho and Tsaig, 2008]) that problem instances where statistical 
estimation succeeds are computationally easy to solve. In fact, we will see in what follows that the worst- 
case value of the distance to infeasibility coincides with classical measures of recovery performance, such 
as restricted eigenvalues [Bickel et ah, 2009] in the £i case. 

On paper, the computational complexities of (1) and (16) are very similar (in fact, infeasible start primal- 
dual algorithms designed for solving (1) actually solve problem (16) with 6 small). In our experiments, we 
did observe sometimes significant differences in behavior between the noisy and noiseless case. 


3.2.2. Generalized restricted eigenvalues. We now further specify the sparsity inducing norms in order to 
study Renegar’s condition number on a class of signals that share the same sparsity properties. We use 
again the framework of sparsity structure introduced by Juditsky et al. [Juditsky et ah, 2014], recalled in 
the previous section. Given a sparsity structure and A: > 0, Lemma 3.2 shows that the worst-case distance to 
infeasibility on the class of /c-sparse signals can be written as 


lik{A) ^ 


inf 


inf 


p&r, u{p)=k, z&T{x) 

xGAf, PBx=Bx, 


l^l|2 


(the first infimum covers all signals x of sparsity k), where the tangent cone is defined here as 

T{x) = cone{z G X : \\Bx -|- Bz\\ < ||i?x||}. 

In the following lemma, we show that this worst-case distance to infeasibility is directly related to 
generalized restricted eigenvalues. 

Lemma 3.4. Given a sparsity structure (|| • ||, 'P),/or P £ V, let 

Cp = 'T{x), and Vp = {z £ X, ||Pi?z|| < ||Pi? 2 ;||}. (18) 

{xG^ : PBx=Bx} 

Then, Cp C T)p. Hence, for any k > 0, 


dk{A)>ak{A)= inf 

PGPki zGX , 
\\PBz\\<\\PBz\\ 


Iklb 


Proof. Let P £ V and z £ T{x) fox x £ X such that PBx = Bx. Using [Juditsky et ah, 2014, 
Lem. 3.1], we have 

||i?x|| -I- ||.Pi?z|| — < \\Bx + Bz\\ < ||f?x||. 

So ||.Pi?z|| < ||Pi3z|| and z £ Dp. ■ 


The inverse of the generalized restricted eigenvalue therefore also bounds the worst case computational 
complexity through the condition number. We remark that, in general, one does not have pk{A) = ak{A). 
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A simple counterexample can be derived for the nuclear norm. Indeed, let <5 = X = i? be the identity 

and (II • II, "P) be the nuclear norm and its associated family of linear maps. Let 


g=(j “) and f/=(» ") 

with u ^ 0. Setting P: X —)■ QXQ, so that P G P, we have ||PP|| = ||PP|| = 0, hence U G Vp. Now 
let 

Xp = {X £ X ■. PX = X] = : X G m| . 

For any X G Xp, ||X + P|| = \/P Au^ > |x| = ||X||, hence U ^ VJx&XpT{X), showing that Pp ^ Cp. 

As shown below however, with the additional assumption that the norm is strictly decomposable, that 
is, P = I — P and that B is bijective (non-overlapping groups) the bound in Lemma 3.4 is tight and 

/ifc(A) = o-fc(A). 

Lemma 3.5. Given a sparsity structure (|| • ||, P), assume that for any P£V,P = I — P. Then Cp = T)p 
and, for any k >D, we have 

Pk{A) = (Jk{A). 

Proof. Let P £ V and z £ X, ||PPz|| < ||PP 2 ;||. Let w = —PBz. We have Pw = w and 
||u; + p^|| = \\{I-P)Bz\\ = \\PBz\\ < \\PBz\\ = ||n;||. 

Thus, z £ T{x) for x such that w = Bx, and PBx = Bx implies z £ Cp.m 


In the £i case, our definition for pk{A) 
et al., 2009], with 


(^k{A) 


= <7k{A) matches the definition of restricted eigenvalue in [Bickel 


inf 

Sc[l,p]:Card(5)=fc 
2:eIRP: IksclllGikslll 


Iklb 


This, we believe, makes for an interesting link between the statistical notion of restricted eigenvalue, and 
the computational notion of Renegar condition number. 


4. Numerical Results 

4.1. Sharpness & restart. We test the restart scheme in Algorithm 1 on £i-recovery problems with random 
design matrices. Throughout the experiments, we use the NESTA code described in [Becker et al., 2011a]. 
We generate a random design matrix A £ with i.i.d Gaussian coefficients. We then normalize A so 

that AA^ = I (to fit NESTA’s format) and generate observations b = Ax* where x* G is a fe-sparse 
vector whose nonzero coefficients are all ones. In Eigure 1 we compare the performance of running NESTA 
with and without restart, for various values of the number of inner iteration t and outer iterations r. We 
observe that restart can improve performance but that this improvement can be neutralized if the number of 
outer iterations is set much too high. Here, we have set p = 500, m = 300 and k = 50. To minimize the 
number of moving parameters, we do not use continuation in [Becker et al., 2011a] hence directly implement 
the method in [Nesterov, 2005]. 
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Figure 1. Restarted NESTA (solid blue line), versus NESTA (dotted black line). Left: 

Using 5 restart and 250 max inner iterations. Right: Using 10 restart and 100 max inner 
iterations. 

We also check the result in Proposition 2.8 on toy problems with p = 100, k = 5 and increasing values 
of m, with m = {25,30,50}. As m grows, the matrix A satisfies the nullspace property (NSP) with 
diminishing values of C. The complexity bound in (9) improves as m increases and the recovery problem 
becomes less ill-posed, which is confirmed by fhe numerical experimenfs reported in Eigure 2. 



Eigure 2. Resfarted NESTA versus number of iterations for m = {25, 30, 50}. 

4.2. Renegar’s condition number and compressed sensing performance. We firsl describe how we ap- 

proximafe fhe value of Cxq (A), we fhen defail numerical experimenfs on synfhefic dafa sefs. 

4.2.1. Computing Cxq{A). The condition number Cxo(A) appears here in upper bounds on compufafional 
complexifies and sfafisfical performances. In order fo fesf numerically whefher fhis quanfify fruly explains 
fhose fealures (as opposed fo merely appearing in a wildly pessimistic bound), we explicifly compufe if in 
numerical experimenfs. 

We focus on fhe ii norm. To compufe Cxq{A), we propose a heuristic which computes pxoi^) in (15), 
fhe value of a nonconvex minimizafion problem over fhe cone of descenf direcfions T(xo). The closure of 
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the latter is the polar of the cone generated by the subdifferential to the £i-norm ball at xq [Chandrasekaran 
et ah, 2012, §2.3]. Let S C {1,... ,p} denote the support of xq, S denote its complement, and |5| denote 
the cardinality of S. Then, with s = sign(xo), 

T(xo) = conejz : zs = ss,Zs £ [-1, = {z E : H^gHi < -SgZg = -s'^z} . 

Thus, (^) is the square root of 

min z'^A^Az s.t. Ilzlb = 1 and lU^lli ^ (19) 

Let A denote the largest eigenvalue of A'^A. If it were not for the cone constraint, solutions of this problem 
would be the dominant eigenvectors of XI — A, which suggests a projected power method [Deshpande 
et ah, 2014] as follows. Given an initial guess E W, ||^;(o )||2 = 1. iterate 

2(fc+i) = Proj^g {{XI - A'^A)z(^k)) , Z(fc+i) = z^k+i)/\\z{k+i)\\2, (20) 

where we used the orthogonal projector to T (xg), 

Projxo( 2 ) = argmm llz - 2 ||| s.t. \\zs\\i<-s'^z. ( 21 ) 

This convex, linearly constrained quadratic program is easily solved with CVX Grant et al. [2001]. As can 
be seen from KKT conditions, this iteration is a generalized power iteration Journee et al. [2008] 

Z(k+i) E argmax 2 ;'^(A/— A^A) 2 :(;s) s.t. \\z \\2 < 1 and ||. 2 ; 5 ||i < —s^z. 

Vxom the latter, it follows that ||Az(fc )||2 decreases monotonically with k. Indeed, owing to convexity of 
f{z) = ^z'^{XI — A'^A)z, we have f{z) — f{z{^k)) > {z — Z(j^-^)'^{XI — A^A)z(j^y The next iterate 
z = ^(fc+i) maximizes this lower bound on the improvement. Since z = z^k-^ is admissible, the improvement 
is nonnegative and f{z(^k^) increases monotonically. 

Thus, the sequence ||Az(;i .)||2 converges, but it may do so slowly, and the value it converges to may 
depend on the initial iterate Z(^Qy On both accounts, it helps greatly to choose Z(g) well. To obtain one, 
we modify (19) by smoothly penalizing the inequality constraint in the cost function, which results in a 
smooth optimization problem on the £2 sphere. Specifically, for small ei, £2 > 0, we use smooth proxies 
h{x) = x"^ + ef — £i ^ |x| and q{x) = £ 2 log(l + exp(x/£ 2 )) ~ max(0, x). Then, with 7 > 0 as 

Lagrange multiplier, we consider 

I \\Az\\l + 7-5 [s^z + h{zi)^ . 

We solve the latter locally with Manopt [Boumal et ah, 2014], itself with a uniformly random initial guess 
on the sphere, to obtain Z(^Qy Then, we iterate the projected power method. The value ||A 2;||2 is an upper 
bound on pxQ (^)^ so that we obtain a lower bound on Cxq {A). Empirically, this procedure, which is random 
only through the initial guess on the sphere, consistently returns the same value, up to five digifs of accuracy, 
which suggesfs fhe proposed heuristic compufes a good approximafion of fhe condifion number. Similarly 
posifive resulfs have been reporfed on ofher cones in Deshpande ef al. [2014], where fhe special sfrucfure 
of fhe cone even made if possible fo cerfify fhaf fhis procedure indeed affains a global opfimum in proposed 
experimenfs. Similarly, a generalized power mefhod was recenfly shown fo converge fo global opfimizers 
for fhe phase synchronization problem (in a cerfain noise regime) Boumal [2016]. This gives us confidence 
in fhe esfimafes produced here. 
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4.2.2. Sparse recovery performance. We conduct numerical experiments in the ii case to illustrate the 
connection between the condition number Cxq(A), the computational complexity of solving (1), and the 
statistical efficiency of the estimator (16). Importantly, throughout the experiments, the classical condition 
number of A will remain essentially constant, so that the main variations cannot be attributed to the latter. 

We follow a standard setup, similar to some of the experiments in Donoho and Tsaig [2008]. Fixing the 
ambient dimension p = 300 and sparsity k = Card(xo) = 15, we let the number of linear measurements n 
vary from 1 to 150. For each value of n, we generate a random signal xq € W (uniformly random support, 
i.i.d. Gaussian entries, unit f' 2 -iiorm) and a random sensing matrix A G with i.i.d. standard Gaussian 
entries. Furthermore, for a fixed value 6 = 10“^, we generate a random noise vector w £ wifh i.i.d. 
sfandard Gaussian enfries, normalized such thaf ||u )||2 = <5|| A|| 2 , and we lef y = Axq + w. This is repeated 
100 limes for each value of n. 

For each Iriplel {A, xq, y), we firsl solve fhe noisy problem (16) wifh fhe Ll-Homolopy algorifhm (r = 
10“^) Asif and Romberg [2014], and reporf the estimation error ||x* — xo|| 2 - Then, we solve the noiseless 
problem (1) with Ll-Homotopy and the TFOCS routine for basis pursuit {p = 1) Becker et al. [2011b]. 
Exact recovery is declared when the error is less than 10“®, and we report the empirical probability of exact 
recovery, together with the number of iterations required by each of the solvers. The number of iterations 
of LARS Efron et al. [2004] is also reported, for comparison. Eor Ll-Homotopy, we report the computation 
time, normalized by the computation time required for one least-squares solve in A, as in [Donoho and Tsaig, 
2008, Eig. 3], which accounts for the growth in n. Einally, we compute the classical condition number of 
A, k{A), as well as (a lower bound on) the cone restricted condition number CxoiA), as per the previous 
section. As it is the computational bottleneck of the experiment, it is only computed for 20 of the 100 
repetitions. 

The results of Eigure 3 show that the cone-restricted condition number explains both the computational 
complexity of (1) and the statistical complexity of (16): fewer samples mean bad conditioning which in 
turn implies high computational complexity. We caution that our estimate of Cxq (A) is only a lower bound. 
Indeed, for small n, the third plot on the left shows that, even in the absence of noise, recovery of xq is 
not achieved by (16). Lemma 3.3 then requires CxoiA) to be infinite. But the computational complexity of 
solving (1) is visibly favorable for small n, where far from the phase transition, problem (P) is far from infea¬ 
sibility, which is just as easy to verify as it is to certify that (P) is infeasible when n is comfortably larger than 
needed. This phenomenon is best explained using a symmetric version of the condition number Amelunxen 
and Lotz [2014] (omitted here to simplify computations). 

We also solved problem ( 1 ) with interior point methods (IPM) via CVX. The number of iterations ap¬ 
peared mostly constant throughout the experiments, suggesting that the practical implementation of such 
solvers renders their complexity mostly data agnostic in the present setting. Likewise, the computation time 
required by Ll-Homotopy on the noisy problem (16), normalized by the time of a least-squares solve, is 
mostly constant (at about 150). This hints that the link between computational complexity of (1) and (16) 
remains to be fully explained. 


Appendix 

The last condition in Definition 2.5 is arguably the least intuitive. Lemma 4.1 below connects it, in some 
cases, with the more intuitive notion of decomposable norm. 

Lemma 4.1. Assume B = \, condition Hi) above, which reads 

||p*j P*g\\,, < max(||/||*, ||p||*), 

for any f,g £ S, implies 


nil > ||Pn|| + ||Pn||. 


for any u £ £. 
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Figure 3. We plot the cone-restricted condition number of A (upper left), explaining both 
the computational complexity of problem (1) (right column) and the statistical complexity 
of problem (16) (second on the left). Central curves represent the mean (geometric mean 
in log-scale plots), red curves correspond to 10th and 90th percentile. We observe that 
high computing times (peaks in the right column) are directly aligned with instances where 
sparse recovery barely holds/fails (left), i.e. near the phase transition around n = 70, where 
the distance to feasibility for problem (P) also follows a phase transition. 
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Proof. We combine the conjugacy result for squared norm [Boyd and Vandenberghe, 2004, Example 
3.27] showing that the conjugate of a squared norm ||x|p/2 is the squared conjugate norm ||x||^/2, with the 
result in [Rockafellar, 1970, Th. 16.3], to show 

(IIP*/ + P*9\\1/2), = inf{||u||2/2 :Pu = y,Pu = z} 

U 

Also, the dual of the norm max(||/||*, ||p||*) is the norm ||?/|| + || 2 :||, hence taking the conjugate of condition 
iii) implies 

inf{||u|| : Pu = y, Pu = z} > ||y|| + || 2 ;|| 

U 

or again 

||u|| > llPnll + ||Pu||. 

which is the desired result. ■ 

This means in particular that ||tt|| = ||Pu|| + ||Pu|| when P = I — P, in which case sparsity systems 
match the decomposable norms setting in [Negahban et al., 2009]. The condition P = I — P holds for the £i 
norm and some group sparsity problems detailed below, but not for the nuclear norm. 
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